Media
optimize selection process
Motion Studios is the largest Radio production house in Europe. Their total revenue $ 1B+. Company has launched a new reality show – "The Star RJ". The show is about finding a new Radio Jockey who will be the star presenter on upcoming shows. In first round participants have to upload their voice clip online and the clip will be evaluated by experts for selection into the next round. There is a separate team in the first round for evaluation of male and female voice.
Response to the show is unprecedented and company is flooded with voice clips. You as a ML expert have to classify the voice as either male/female so that first level of filtration is quicker.
Voice sample are across accents
The output from the pre-processed WAV files were saved into the CSV file
Approx 3000 records
voice-classification.csv
meanfreq: mean frequency (in kHz)
sd: standard deviation of frequency
median: median frequency (in kHz)
Q25: first quantile (in kHz)
Q75: third quantile (in kHz)
IQR: interquantile range (in kHz)
skew: skewness (see note in specprop description)
kurt: kurtosis (see note in specprop description)
sp.ent: spectral entropy
sfm: spectral flatness
mode: mode frequency
centroid: frequency centroid (see specprop)
peakf: peak frequency (frequency with highest energy)
meanfun: average of fundamental frequency measured across acoustic signal
minfun: minimum fundamental frequency measured across acoustic signal
maxfun: maximum fundamental frequency measured across acoustic signal
meandom: average of dominant frequency measured across acoustic signal
mindom: minimum of dominant frequency measured across acoustic signal
maxdom: maximum of dominant frequency measured across acoustic signal
dfrange: range of dominant frequency measured across acoustic signal
modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
label: male or female
NA
Since "The Star RJ" is a reality show, time to select candidates is very short. The whole success of the show and hence the profits depends upon quick and smooth execution.
# Import all required libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Import data
voice = pd.read_csv('voice-classification.csv')
# Check the sample top rows of the data.
voice.head()
# Checking the information of the dataframe
# This will tell us about NULL or missing values in the data.
voice.info()
# Checking the basic statistics of the data for the numeric fields.
voice.describe()
As seen above, we have columns : skew, kert, maxdom, and dfrange : which have outliers, since the average value is quite far from the max. value.
Clearly, the target variable 'label' is a character variable, which needs to be label encoded first as number. Also, feature scaling should be performed.
# Importing library for scaling of data.
from sklearn.preprocessing import StandardScaler
# Taking a list of all columns in the dataset.
voice.columns
# Creating separate list of feature variables.
voice_features = ['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx']
# Creating object of standard scaler.
scaler_voice = StandardScaler()
# Scaling all feature variables in the dataset.
voice[voice_features] = scaler_voice.fit(voice[voice_features]).transform(voice[voice_features])
# Checking the data post scaling
voice.head(10)
# Checking data statistics post scaling
voice.describe()
# Now performing label encoding on the target variable.
# Importing library for encoding.
from sklearn.preprocessing import LabelEncoder
# Creating object for label encoding.
le_voice = LabelEncoder()
# Fitting the data.
voice.label = le_voice.fit(voice.label).transform(voice.label)
# Finally checking the output dataframe information.
voice.info()
# Checking the sample rows once more.
voice.head(10)
# Checking the descriptive statistics of the dataframe once more as well.
voice.describe()
# PLot to see the distribution of labels in the dataset.
plt.figure(figsize=(8, 5))
sns.countplot(voice.label)
plt.show()
# Checking pairplots of the entire dataset.
sns.pairplot(voice)
Although not clearly visible, because of the number of features involved here, but it is clear that very few variables are close to being normally distributed, while the rest are either left- or right-skewed. Interestingly, there are a few which also shows quite strong correlation/relationship amongst each other.
At this initial stage, we will create an SVM Model to perform the classification. Later on, we will perform dimensionality reduction to get our features reduced.
# Importing the required libraries.
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# Creating dataframe of feature columns.
X = voice[['meanfreq', 'sd', 'median', 'Q25', 'Q75', 'IQR', 'skew', 'kurt',
'sp.ent', 'sfm', 'mode', 'centroid', 'meanfun', 'minfun', 'maxfun',
'meandom', 'mindom', 'maxdom', 'dfrange', 'modindx']]
# Creating dataframe of target variable.
y = voice.label
# Importing library to create train test split of data.
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42)
# Creating object of SVM Classifier.
initial_voice_clf = SVC()
# fitting the training data.
initial_voice_clf.fit(x_train, y_train)
# Getting the predictions.
init_y_pred = initial_voice_clf.predict(x_test)
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",round(metrics.accuracy_score(y_test, init_y_pred)*100,2))
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, init_y_pred))
# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, init_y_pred))
The initial model, built above, without any specific parameters being used, produces very good results overall. The accuracy is 98% approx. and the values of Precision and Recall are also quite good. Further, to validate the model and to make it better, we must perform tuning of the model.
Before application of PCA to reduce the features in the dataset, we will perform Grid Search Cross Validation method to find out the optimal values of Gamma and C which are Hyper-Parameters in ou SVM Classifier Model.
Once we find this, we will tune our model accordingly, and then later on, proceed with PCA.
# Importing required libraries.
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# as a start, creating a K-Fold object with 5 splits
folds = KFold(n_splits = 5, shuffle = True, random_state = 101)
# firstly, we will specify range of hyperparameters - later, we will set the parameters by cross-validation
hyper_params = [ {'gamma': [1e-2, 1e-3],
'C': [1, 10, 100]}]
# specify model with RBF kernel.
rbf_voice_clf = SVC(kernel="rbf")
# set up GridSearchCV()
rbf_voice_clf_cv = GridSearchCV(estimator = rbf_voice_clf,
param_grid = hyper_params,
scoring= 'accuracy',
cv = folds,
verbose = 1,
return_train_score=True,
n_jobs=-1)
# fit the model
rbf_voice_clf_cv.fit(x_train, y_train)
# cv results
cv_results = pd.DataFrame(rbf_voice_clf_cv.cv_results_)
cv_results
# converting C to numeric type for plotting on x-axis
cv_results['param_C'] = cv_results['param_C'].astype('int')
# # plotting
plt.figure(figsize=(20,10))
# subplot 1/2
plt.subplot(121)
gamma_01 = cv_results[cv_results['param_gamma']==0.01]
plt.plot(gamma_01["param_C"], gamma_01["mean_test_score"])
plt.plot(gamma_01["param_C"], gamma_01["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.01")
plt.ylim([0.70, 1.2])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
# subplot 2/2
plt.subplot(122)
gamma_001 = cv_results[cv_results['param_gamma']==0.001]
plt.plot(gamma_001["param_C"], gamma_001["mean_test_score"])
plt.plot(gamma_001["param_C"], gamma_001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.001")
plt.ylim([0.70, 1.2])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
# printing the optimal accuracy score and hyperparameters
best_score = rbf_voice_clf_cv.best_score_
best_hyperparams = rbf_voice_clf_cv.best_params_
print("The best test score is {0} corresponding to hyperparameters {1}".format(best_score, best_hyperparams))
As seen above, using Grid Search Cross Validation method, we find that C should be = 10 and gamma should be 0.01 with RBF Kernel for our model to produce a 98% accuracy.
# model with optimal hyperparameters
# create the model
rbf_voice_final = SVC(C=10, gamma=0.001, kernel="rbf")
# fit the model
rbf_voice_final.fit(x_train, y_train)
# predict using test data.
y_pred_rbf = rbf_voice_final.predict(x_test)
# metrics of test data.
print("Accuracy of our model : ", metrics.accuracy_score(y_test, y_pred_rbf), "\n")
Finally, let's perform PCA (Principal Component Analysis) now, to find the optimal parameters and also to improve our model. The aim is not to increase the accuracy, but to ensure that our model is more stable.
# Import PCA library in SKLearn
from sklearn.decomposition import PCA
# Make an instance of the Model
pca_voice = PCA(.95)
# Fitting the Training and Testing data on above PCA.
pca_voice.fit(x_train)
x_train_pca = pca_voice.transform(x_train)
x_test_pca = pca_voice.transform(x_test)
# print the number of components selected by PCA.
print("No Of Components selected via PCA: ", pca_voice.n_components_)
# model with optimal hyperparameters
# create the model
model_voice_pca = SVC(C=10, gamma=0.001, kernel="rbf")
# fit the model
model_voice_pca.fit(x_train_pca, y_train)
# predict using test data.
y_pred_pca = model_voice_pca.predict(x_test_pca)
# metrics of test data.
print("Accuracy of our model post PCA : ", metrics.accuracy_score(y_test, y_pred_pca), "\n")
Hereby, we find that by making use of 10 components from the data-set, we are able to achieve a 97% accuracy of our SVM Classifier model. Finally, let's try and find out the features which explain most of the variance in our data and our model, as found from the PCA.
# Checking the explained variance ratio as found in PCA
pca_voice.explained_variance_ratio_
# Taking the count of PCA components.
n_pcs= pca_voice.components_.shape[0]
n_pcs
# get the index of the most important feature on EACH component i.e. largest absolute value
most_important = [np.abs(pca_voice.components_[i]).argmax() for i in range(n_pcs)]
most_important
# getting most important features.
most_important_names = [voice_features[most_important[i]] for i in range(n_pcs)]
# creating the dictionary for getting the list of important fatures.
dic_voice_imp_fields = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}
# building a dataframe to get list of columns
df_voice_imp_fields = pd.DataFrame(sorted(dic_voice_imp_fields.items()))
df_voice_imp_fields
As seen above, the variables centroid, mode, sn.ent, kurt, Q75, meanfun, minfun, mindom, modindx and maxfun are the most important features in the dataset which helps to decide whether the voice is male or female.